Assignment #2 -- MicroService System Troubleshooting¶

Questions to Address:

In this assignment, we are required to give explanations for the 5 failures stated in failures.json, where their information can be found in the test_data.

A failure example found in failures.json:

[
2220,  # failure time, Seconds past 00:00
 [
     ["os_021", "Sent_queue"] 
 ] # 2 root causes: [cmdb_id, key_name]
]

In the above, there is something wrong within the system at 00:37. We need to figure out in the system the specific performance that can indicate the failure and the corresponding root causes. For instance, we can give our answer from the following respects:

  1. How the ESB data behave abnormally?
  2. How the root causes affects the related Trace and Host KPIs?

Beware of the components in a trace, as they may influence each other through their call relationships.

Import libraries and dependencies¶

Data Exploration¶

Point #1: The processes can be deeply nested¶

Example:

cb48b17261295c467617 (37bd717261295c467616)
-> 8073017261295c527619 (37bd717261295c467616)
-> 62e7817261295c589816 (37bd717261295c467616)
-> ... (19 IDs)
-> ... (etc...)

Tip: Follow the `traceId` to obtain a group of related service calls

callType startTime msgTime elapsedTime success traceId id pid cmdb_id serviceName dsName
0 CSF 2020-05-30 00:00:00.082000+08:00 2020-05-30 00:00:00.968000+08:00 886 True 37bd717261295c467616 8073017261295c527619 cb48b17261295c467617 os_021 csf_001 NaN
4 CSF 2020-05-30 00:00:00.103000+08:00 2020-05-30 00:00:00.264000+08:00 160 True 37bd717261295c467616 fca3117261295c679826 62e7817261295c589816 docker_004 csf_002 NaN
10 CSF 2020-05-30 00:00:00.268000+08:00 2020-05-30 00:00:00.440000+08:00 171 True 37bd717261295c467616 15dcd17261295d0a9830 62e7817261295c589816 docker_004 csf_003 NaN
12 CSF 2020-05-30 00:00:00.441000+08:00 2020-05-30 00:00:00.559000+08:00 118 True 37bd717261295c467616 d2fdc17261295db99834 62e7817261295c589816 docker_004 csf_004 NaN
16 CSF 2020-05-30 00:00:00.561000+08:00 2020-05-30 00:00:00.614000+08:00 53 True 37bd717261295c467616 fe0fa17261295e319838 62e7817261295c589816 docker_004 csf_005 NaN
1706877 FlyRemote 2020-05-30 00:00:00.094000+08:00 2020-05-30 00:00:00.102000+08:00 8 True 37bd717261295c467616 1f47017261295c5e9822 62e7817261295c589816 docker_004 fly_remote_001 NaN
2050698 JDBC 2020-05-30 00:00:00.183000+08:00 2020-05-30 00:00:00.189000+08:00 6 True 37bd717261295c467616 b212a17261295cb74404 f65a417261295cb64401 docker_007 NaN db_003
2050700 JDBC 2020-05-30 00:00:00.192000+08:00 2020-05-30 00:00:00.252000+08:00 60 True 37bd717261295c467616 9b86517261295cc04407 f65a417261295cb64401 docker_007 NaN db_003
2050706 JDBC 2020-05-30 00:00:00.281000+08:00 2020-05-30 00:00:00.289000+08:00 8 True 37bd717261295c467616 a472c17261295d193626 ce7b117261295d193623 docker_008 NaN db_003
2050710 JDBC 2020-05-30 00:00:00.291000+08:00 2020-05-30 00:00:00.384000+08:00 93 True 37bd717261295c467616 54aa417261295d233629 ce7b117261295d193623 docker_008 NaN db_003
2050714 JDBC 2020-05-30 00:00:00.387000+08:00 2020-05-30 00:00:00.419000+08:00 32 True 37bd717261295c467616 5b62017261295d833634 e0dfa17261295d823631 docker_008 NaN db_003
2050718 JDBC 2020-05-30 00:00:00.448000+08:00 2020-05-30 00:00:00.453000+08:00 5 True 37bd717261295c467616 7a76117261295dc04492 5847c17261295dc04489 docker_007 NaN db_003
2050721 JDBC 2020-05-30 00:00:00.456000+08:00 2020-05-30 00:00:00.524000+08:00 68 True 37bd717261295c467616 73a0517261295dc84497 5847c17261295dc04489 docker_007 NaN db_003
2050726 JDBC 2020-05-30 00:00:00.529000+08:00 2020-05-30 00:00:00.538000+08:00 9 True 37bd717261295c467616 8ff5417261295e114514 0f31817261295e0f4509 docker_007 NaN db_003
2050730 JDBC 2020-05-30 00:00:00.574000+08:00 2020-05-30 00:00:00.583000+08:00 9 True 37bd717261295c467616 f65c017261295e3e4530 12b3917261295e3d4527 docker_007 NaN db_003
2050734 JDBC 2020-05-30 00:00:00.584000+08:00 2020-05-30 00:00:00.594000+08:00 10 True 37bd717261295c467616 af64017261295e484543 12b3917261295e3d4527 docker_007 NaN db_003
2050738 JDBC 2020-05-30 00:00:00.615000+08:00 2020-05-30 00:00:00.616000+08:00 1 True 37bd717261295c467616 ef07117261295e679851 e32fb17261295e679848 docker_004 NaN db_009
2050739 JDBC 2020-05-30 00:00:00.616000+08:00 2020-05-30 00:00:00.617000+08:00 1 True 37bd717261295c467616 a84b317261295e689854 e32fb17261295e679848 docker_004 NaN db_009
2050740 JDBC 2020-05-30 00:00:00.619000+08:00 2020-05-30 00:00:00.620000+08:00 1 True 37bd717261295c467616 bda4c17261295e6b9859 8282817261295e699856 docker_004 NaN db_009
2050743 JDBC 2020-05-30 00:00:00.622000+08:00 2020-05-30 00:00:00.622000+08:00 0 True 37bd717261295c467616 75a6917261295e6e9864 27dbc17261295e6c9861 docker_004 NaN db_009
2050746 JDBC 2020-05-30 00:00:00.624000+08:00 2020-05-30 00:00:00.625000+08:00 0 True 37bd717261295c467616 6e03d17261295e709869 bdd2a17261295e6f9866 docker_004 NaN db_009
2050747 JDBC 2020-05-30 00:00:00.626000+08:00 2020-05-30 00:00:00.627000+08:00 1 True 37bd717261295c467616 87d6417261295e729879 c7c0817261295e719873 docker_004 NaN db_009
2050753 JDBC 2020-05-30 00:00:00.630000+08:00 2020-05-30 00:00:00.630000+08:00 0 True 37bd717261295c467616 836f117261295e769892 91d0417261295e739886 docker_004 NaN db_009
2050759 JDBC 2020-05-30 00:00:00.633000+08:00 2020-05-30 00:00:00.633000+08:00 0 True 37bd717261295c467616 ba88317261295e799904 8145f17261295e779896 docker_004 NaN db_009
2050771 JDBC 2020-05-30 00:00:00.651000+08:00 2020-05-30 00:00:00.652000+08:00 1 True 37bd717261295c467616 0595717261295e8b9914 c96ae17261295e7a9909 docker_004 NaN db_009
2050774 JDBC 2020-05-30 00:00:00.655000+08:00 2020-05-30 00:00:00.656000+08:00 1 True 37bd717261295c467616 6696d17261295e8f9927 a8bc817261295e8d9919 docker_004 NaN db_009
2050775 JDBC 2020-05-30 00:00:00.658000+08:00 2020-05-30 00:00:00.658000+08:00 0 True 37bd717261295c467616 c49b317261295e929937 0efe517261295e909931 docker_004 NaN db_009
2050777 JDBC 2020-05-30 00:00:00.660000+08:00 2020-05-30 00:00:00.661000+08:00 1 True 37bd717261295c467616 fd96b17261295e949947 09fce17261295e939941 docker_004 NaN db_009
2050780 JDBC 2020-05-30 00:00:00.664000+08:00 2020-05-30 00:00:00.664000+08:00 0 True 37bd717261295c467616 0448517261295e989959 6bdec17261295e969954 docker_004 NaN db_009
2050782 JDBC 2020-05-30 00:00:00.666000+08:00 2020-05-30 00:00:00.666000+08:00 0 True 37bd717261295c467616 a39cc17261295e9a9969 f5a3e17261295e989964 docker_004 NaN db_009
2050784 JDBC 2020-05-30 00:00:00.667000+08:00 2020-05-30 00:00:00.670000+08:00 3 True 37bd717261295c467616 86e3c17261295e9b9977 79b8c17261295e9b9974 docker_004 NaN db_007
2050787 JDBC 2020-05-30 00:00:00.671000+08:00 2020-05-30 00:00:00.674000+08:00 3 True 37bd717261295c467616 90a8f17261295e9f9995 79b8c17261295e9b9974 docker_004 NaN db_007
10886645 LOCAL 2020-05-30 00:00:00.182000+08:00 2020-05-30 00:00:00.253000+08:00 71 True 37bd717261295c467616 f65a417261295cb64401 7bf1417261295c9b4394 docker_007 local_method_011 db_003
10886649 LOCAL 2020-05-30 00:00:00.281000+08:00 2020-05-30 00:00:00.385000+08:00 104 True 37bd717261295c467616 ce7b117261295d193623 5a67717261295d123619 docker_008 local_method_013 db_003
10886653 LOCAL 2020-05-30 00:00:00.386000+08:00 2020-05-30 00:00:00.419000+08:00 33 True 37bd717261295c467616 e0dfa17261295d823631 5a67717261295d123619 docker_008 local_method_014 db_003
10886657 LOCAL 2020-05-30 00:00:00.448000+08:00 2020-05-30 00:00:00.524000+08:00 76 True 37bd717261295c467616 5847c17261295dc04489 c9d2b17261295dbb4485 docker_007 local_method_015 db_003
10886661 LOCAL 2020-05-30 00:00:00.527000+08:00 2020-05-30 00:00:00.539000+08:00 12 True 37bd717261295c467616 0f31817261295e0f4509 c9d2b17261295dbb4485 docker_007 local_method_016 db_003
10886665 LOCAL 2020-05-30 00:00:00.574000+08:00 2020-05-30 00:00:00.594000+08:00 20 True 37bd717261295c467616 12b3917261295e3d4527 9ecec17261295e344519 docker_007 local_method_017 db_003
10886669 LOCAL 2020-05-30 00:00:00.615000+08:00 2020-05-30 00:00:00.617000+08:00 2 True 37bd717261295c467616 e32fb17261295e679848 62e7817261295c589816 docker_004 local_method_001 db_009
10886670 LOCAL 2020-05-30 00:00:00.617000+08:00 2020-05-30 00:00:00.620000+08:00 3 True 37bd717261295c467616 8282817261295e699856 62e7817261295c589816 docker_004 local_method_002 db_009
10886671 LOCAL 2020-05-30 00:00:00.620000+08:00 2020-05-30 00:00:00.623000+08:00 3 True 37bd717261295c467616 27dbc17261295e6c9861 62e7817261295c589816 docker_004 local_method_002 db_009
10886674 LOCAL 2020-05-30 00:00:00.623000+08:00 2020-05-30 00:00:00.625000+08:00 2 True 37bd717261295c467616 bdd2a17261295e6f9866 62e7817261295c589816 docker_004 local_method_003 db_009
10886677 LOCAL 2020-05-30 00:00:00.625000+08:00 2020-05-30 00:00:00.627000+08:00 2 True 37bd717261295c467616 c7c0817261295e719873 62e7817261295c589816 docker_004 local_method_004 db_009
10886679 LOCAL 2020-05-30 00:00:00.627000+08:00 2020-05-30 00:00:00.631000+08:00 4 True 37bd717261295c467616 91d0417261295e739886 62e7817261295c589816 docker_004 local_method_004 db_009
10886684 LOCAL 2020-05-30 00:00:00.631000+08:00 2020-05-30 00:00:00.634000+08:00 3 True 37bd717261295c467616 8145f17261295e779896 62e7817261295c589816 docker_004 local_method_004 db_009
10886690 LOCAL 2020-05-30 00:00:00.634000+08:00 2020-05-30 00:00:00.653000+08:00 19 True 37bd717261295c467616 c96ae17261295e7a9909 62e7817261295c589816 docker_004 local_method_004 db_009
10886702 LOCAL 2020-05-30 00:00:00.653000+08:00 2020-05-30 00:00:00.656000+08:00 3 True 37bd717261295c467616 a8bc817261295e8d9919 62e7817261295c589816 docker_004 local_method_005 db_009
10886705 LOCAL 2020-05-30 00:00:00.656000+08:00 2020-05-30 00:00:00.659000+08:00 3 True 37bd717261295c467616 0efe517261295e909931 62e7817261295c589816 docker_004 local_method_006 db_009
10886707 LOCAL 2020-05-30 00:00:00.659000+08:00 2020-05-30 00:00:00.662000+08:00 3 True 37bd717261295c467616 09fce17261295e939941 62e7817261295c589816 docker_004 local_method_007 db_009
10886709 LOCAL 2020-05-30 00:00:00.662000+08:00 2020-05-30 00:00:00.664000+08:00 2 True 37bd717261295c467616 6bdec17261295e969954 62e7817261295c589816 docker_004 local_method_006 db_009
10886711 LOCAL 2020-05-30 00:00:00.664000+08:00 2020-05-30 00:00:00.667000+08:00 3 True 37bd717261295c467616 f5a3e17261295e989964 62e7817261295c589816 docker_004 local_method_009 db_009
10886713 LOCAL 2020-05-30 00:00:00.667000+08:00 2020-05-30 00:00:00.674000+08:00 7 True 37bd717261295c467616 79b8c17261295e9b9974 62e7817261295c589816 docker_004 local_method_010 db_007
17687689 OSB 2020-05-30 00:00:00.070000+08:00 2020-05-30 00:00:00.993000+08:00 922 True 37bd717261295c467616 cb48b17261295c467617 None os_021 osb_001 NaN
18033277 RemoteProcess 2020-05-30 00:00:00.088000+08:00 2020-05-30 00:00:00.679000+08:00 591 True 37bd717261295c467616 62e7817261295c589816 8073017261295c527619 docker_004 csf_001 NaN
18033281 RemoteProcess 2020-05-30 00:00:00.155000+08:00 2020-05-30 00:00:00.262000+08:00 107 True 37bd717261295c467616 7bf1417261295c9b4394 fca3117261295c679826 docker_007 csf_002 NaN
18033288 RemoteProcess 2020-05-30 00:00:00.274000+08:00 2020-05-30 00:00:00.438000+08:00 164 True 37bd717261295c467616 5a67717261295d123619 15dcd17261295d0a9830 docker_008 csf_003 NaN
18033289 RemoteProcess 2020-05-30 00:00:00.443000+08:00 2020-05-30 00:00:00.558000+08:00 115 True 37bd717261295c467616 c9d2b17261295dbb4485 d2fdc17261295db99834 docker_007 csf_004 NaN
18033293 RemoteProcess 2020-05-30 00:00:00.564000+08:00 2020-05-30 00:00:00.613000+08:00 49 True 37bd717261295c467616 9ecec17261295e344519 fe0fa17261295e319838 docker_007 csf_005 NaN

Point #2: There are only 10 different cmdb_id¶

  • docker_001 to docker_008
  • os_021 to os_022

Therefore, for this assignment we will, as much as possible, associate to these cmdb_ids

count
docker_002 3065368
docker_003 3061799
docker_004 3060144
docker_001 3058418
docker_005 1703686
docker_007 1701104
docker_008 1700397
docker_006 1699433
os_022 344828
os_021 344788

Point #3: Different cmdb_id have difference callType¶

  • docker_001 to docker_004 contain: JDBC, LOCAL, CSF, FlyRemote, RemoteProcess
  • docker_005 to docker_008 contain: JDBC, LOCAL, RemoteProcess
  • os_021 to os_022 contain: OSB, CSF

Note that this merely means that these are the services (callType) under the host

It does not mean that the cmdb_id will not call other cmdb_id (Refer to example in Point #1: where processes apart from os_021 are called too)

count
cmdb_id callType
docker_001 JDBC 1357451
LOCAL 1188712
CSF 340455
FlyRemote 85900
RemoteProcess 85900
docker_002 JDBC 1361103
LOCAL 1191050
CSF 341129
FlyRemote 86043
RemoteProcess 86043
docker_003 JDBC 1359637
LOCAL 1189566
CSF 340687
RemoteProcess 85964
FlyRemote 85945
docker_004 JDBC 1358624
LOCAL 1189076
CSF 340578
FlyRemote 85933
RemoteProcess 85933
docker_005 JDBC 851049
LOCAL 511518
RemoteProcess 341119
docker_006 JDBC 848925
LOCAL 510043
RemoteProcess 340465
docker_007 JDBC 849753
LOCAL 510719
RemoteProcess 340632
docker_008 JDBC 849405
LOCAL 510360
RemoteProcess 340632
os_021 OSB 172798
CSF 171990
os_022 OSB 172790
CSF 172038

Point #4: Observing the ratio of callType for difference processes¶

  • docker_001 to docker_004 ratio:
    1. JDBC = 44%
    2. LOCAL = 38%
    3. CSF = 11%
    4. FlyRemote = 2%
    5. RemoteProcess = 2%
  • docker_005 to docker_008 ratio:
    1. JDBC = 50%
    2. LOCAL = 30%
    3. RemoteProcess = 20%
  • os_021 to os_022 ratio:
    1. OSB = 50%
    2. CSF = 50%

These could suggest a certain mapping amongst each service call. For example, we suspect that OSB and CSF is closely related in a 1-to-1 mapping

ratio
cmdb_id callType
docker_001 JDBC 44.38%
LOCAL 38.87%
CSF 11.13%
FlyRemote 2.81%
RemoteProcess 2.81%
docker_002 JDBC 44.4%
LOCAL 38.86%
CSF 11.13%
FlyRemote 2.81%
RemoteProcess 2.81%
docker_003 JDBC 44.41%
LOCAL 38.85%
CSF 11.13%
RemoteProcess 2.81%
FlyRemote 2.81%
docker_004 JDBC 44.4%
LOCAL 38.86%
CSF 11.13%
FlyRemote 2.81%
RemoteProcess 2.81%
docker_005 JDBC 49.95%
LOCAL 30.02%
RemoteProcess 20.02%
docker_006 JDBC 49.95%
LOCAL 30.01%
RemoteProcess 20.03%
docker_007 JDBC 49.95%
LOCAL 30.02%
RemoteProcess 20.02%
docker_008 JDBC 49.95%
LOCAL 30.01%
RemoteProcess 20.03%
os_021 OSB 50.12%
CSF 49.88%
os_022 OSB 50.11%
CSF 49.89%

Point #5: Visualising the different serviceName¶

Note the difference service names available for each callType and each cmdb_id

For docker_001 to docker_004:

  • JDBC -> No serviceName
  • LOCAL -> local_method_001 to local_method_010
  • CSF -> csf_002 to csf_005
  • FlyRemote -> fly_remmote_001
  • RemoteProcess -> csf_001

For docker_005 to docker_008:

  • JDBC -> No serviceName
  • LOCAL -> local_method_011, local_method_013 to local_method_017
  • RemoteProcess -> csf_002 to csf_005

For docker_001 to docker_004:

  • CSF -> csf_001
  • OSB -> osb_001
count
cmdb_id callType serviceName
docker_001 CSF csf_002 85724
csf_003 84913
csf_004 84910
csf_005 84908
FlyRemote fly_remote_001 85900
LOCAL local_method_004 339632
local_method_002 169816
local_method_006 169816
local_method_001 84908
local_method_003 84908
local_method_005 84908
local_method_007 84908
local_method_009 84908
local_method_010 84908
RemoteProcess csf_001 85900
docker_002 CSF csf_002 85893
csf_003 85081
csf_004 85078
csf_005 85077
FlyRemote fly_remote_001 86043
LOCAL local_method_004 340300
local_method_002 170150
local_method_006 170150
local_method_001 85075
local_method_003 85075
local_method_005 85075
local_method_007 85075
local_method_009 85075
local_method_010 85075
RemoteProcess csf_001 86043
docker_003 CSF csf_002 85770
csf_003 84974
csf_004 84972
csf_005 84971
FlyRemote fly_remote_001 85945
LOCAL local_method_004 339876
local_method_002 169938
local_method_006 169938
local_method_001 84969
local_method_003 84969
local_method_005 84969
local_method_007 84969
local_method_009 84969
local_method_010 84969
RemoteProcess csf_001 85964
docker_004 CSF csf_002 85768
csf_003 84939
csf_004 84937
csf_005 84934
FlyRemote fly_remote_001 85933
LOCAL local_method_004 339736
local_method_002 169868
local_method_006 169868
local_method_001 84934
local_method_003 84934
local_method_005 84934
local_method_007 84934
local_method_009 84934
local_method_010 84934
RemoteProcess csf_001 85933
docker_005 LOCAL local_method_011 85891
local_method_015 85369
local_method_016 85368
local_method_013 85031
local_method_014 85031
local_method_017 84828
RemoteProcess csf_002 85891
csf_004 85369
csf_003 85031
csf_005 84828
docker_006 LOCAL local_method_011 85726
local_method_017 85157
local_method_013 84963
local_method_014 84960
local_method_015 84619
local_method_016 84618
RemoteProcess csf_002 85726
csf_005 85157
csf_003 84963
csf_004 84619
docker_007 LOCAL local_method_011 85756
local_method_015 85197
local_method_016 85194
local_method_013 84895
local_method_014 84893
local_method_017 84784
RemoteProcess csf_002 85756
csf_004 85197
csf_003 84895
csf_005 84784
docker_008 LOCAL local_method_011 85780
local_method_017 85121
local_method_013 85018
local_method_014 85017
local_method_015 84712
local_method_016 84712
RemoteProcess csf_002 85781
csf_005 85121
csf_003 85018
csf_004 84712
os_021 CSF csf_001 171990
OSB osb_001 172798
os_022 CSF csf_001 172038
OSB osb_001 172790

Point #6: All traceId start with a root that is either os_021 or os_022¶

os_021    365022
os_022    365019
Name: cmdb_id, dtype: int64
os_021    172798
os_022    172790
Name: cmdb_id, dtype: int64

Report Analysis¶

Goal: To figure out the specific performance that can indicate the failure and the corresponding root causes, giving the answer from the following respects:

  1. How the ESB data behave abnormally?
  2. How the root causes affects the related Trace and Host KPIs?

Ultimately, the goal is to understand how each failure can be identified (through which metrics) in the given data. This mimics a real-world machine learning project, where we first have to understand the data and obtain the relevant domain knowledge to perform feature selection.

In this case, Assignment 2 is a lead up to the course project, where we will carefully choose the algorithms to apply based on our findings in Assignment 2.

This report analysis will be structured as follow:

  1. How did the failure affect ESB (Observe the ESB indicators)
  2. How did the failure affect host data (Observe the Host indicators)
  3. How did the failure affect trace data (Observe the Trace indicators)

For ESB Indicators

We will plot a comparison graph between test (with failure) and train (no failure) data, and across a period of 1 hour (since ESB indicators are recorded in 1 min intervals) to observe the impact of the failure on the business-side metrics. These are the metrics that impact and of primary concern for end users (business POV).

For Host Indicators

We will plot a comparison graph between test (with failure) and train (no failure) data, and across specified periods (since host indicators are recorded in different intervals) to observe the impact of the failure on the host machine metrics. These provide the high-level view of metrics indicating failure and are of primary concern for infrastructure engineers.

For Trace Indicators

We will plot a comparison graph between test (with failure) and train (no failure) data to observe the impact of the failure on the service request metrics. These may serve to both explain the cause for the failure (developer POV) and for us to identify similar failures in the future (site reliability engineer POV), depending on the perspective.

However, this is more complicated and the general steps is summarised as follow:

  1. Combine all trace files into 1 dataframe
  2. Filter based on time (dynamic approach - refer to code for explanation)
  3. Filter for the parent process (to find traceId and track the parent process statistics)
  4. Compute the trace length for the traceId found above
  5. Visualise Step 4
  6. Depending on the scenario, we will further visualise the elapsedTime of the parent process (e.g. further split by cmdb_id)
  7. Depending on the scenario, we will analyse the root cause (e.g. Failure 1 - why did the parent process take more time)
  8. Based on 5, 6 and 7, our analysis will differ for each failure to explain their cause and how to identify similar failures

Breaking Down Failure #1¶

Failure #1 Details:

[
    2220, 
    [[
        "os_021", 
        "Sent_queue"
    ]]
]

ESB Indicators¶

Analysis: From the 4 line charts, we observe that although the success rate did not change (subplot #4 remained constant at 1), the average time taken to process each service call increased sharply and the number of services requests handled decreased by half around the failure time period. This unusual behaviour can be noted by comparing against itself across 1 hour, as well as comparing against the train data (normal data with no failures) across the same time period. This indicates that failure #1 does not result in a failure of a service call, but could lead to other side effects such as longer call times and fewer services requests handled. This phenomenon could potentially be due to the app's error handling or simply due to poorer load balancing effects.

Given that os_021 is the cause and since os_021 is a virtual machine hosting the app services, we suspect that the failure of os_021 meant that there are fewer machines processing the service requests. This is aligned with the num which decreased to half of its original level (because 1 out of 2 os is down). With a higher load across fewer available machines, this would lead to a higher average time taken to process each service request and fewer service requests processed in the same amount of time.

Host Indicators¶

Analysis: Given that Sent_queue is a host KPI to observe for failure #1, we plot the following graph - comparing Sent_queue values across 70 mins (corresponding to 14 data points) and against the train data for the same time period.

We observe that the Sent_queue value peaked right after the failure time (00:37:00), before returning back to normal after some time. Sent_queue represents the number of service requests being queued for processing. Combining this observation with our previous speculation (in ESB Indicators), we can confirm that the failure involves a virtual machine that hosts the app services in the distributed system network. Therefore, the failure of os_021 affected load balancing, causing service requests to take longer time to process and thus, fewer service requests handled in the same time period.

However, from the graph, we noticed that the accumulated requests dropped back to 0 soon after. This could mean that either the failure machine os_021 is fixed soon after, or the remaining serviceable machines managed to process the queue. It is a possibility that the load balancing algorithm adjusted itself, or the remaining machines scaled up vertically (to handle more requests in 1 machine) or horizontally (by spinning up another temporary machine to aid). The reason for this is not part of our analysis in Assignment 2.

In conclusion, we note that a possible failure could involve the virtual machine hosting the services. Such failures, although would not affect the service's success rate, would affect its latency and lead to slower execution. To identify these problems during production, we can observe for any drastic / abnormal increase in the operating system's Sent_queue metrics.

Trace Indicators¶

Analysis: Since we know that os_021 has a failure, we will plot the following graphs - (1) comparing the length of trace data, (2) comparing the elapsed time of the parent processes for os_021, (3) comparing the elapsed time of the parent processes for os_022

These graphs' time period are carefully chosen to ensure that long-elapse-time functions are shown and there are ample data before & after to ascertain the trend.

From the graphs, we can tell that:

  1. Number of trace invocations (trace length) remained roughly the same despite failure
  2. os_021 experienced a distinct increase in elapsed time as compared to train_data and os_022 (for the same time period)
  3. os_022 did not exeperience any change in elapsed time

Finding 1: osb_001 is the parent of csf_001 (thus, osb_001 is always larger than csf_001) However, given that osb is the inside span while csf is the outside span, this meant that osb - csf will give us the actual time for osb. This prompts us to write a code to generate the actual time for each trace.

The logic of the code lies in that every trace will start with the inner span followed by 1 or more outer span childrens. Therefore, the actual time should be interpreted as:

  1. For inner spans: time taken to run the code inside the services
  2. For outer spans: time taken to for the code to "perform context switch" to the service (or maybe push data to stack)

Subsequently, we sampled for 100 traceIDs (since 100 is a large sample size for approximation) from both the train and test data to:

  1. Calculate actual time taken for each process ID, combining them through a unique identifier (in the format of cmdb_id:serviceName:dsName)
  2. Find out the average time taken across all 100 traceIDs for their respective
  3. Merge the train and test dataframe to compute their differences (where possible)

From the output of the above step, we notice that:

  1. os_021:osb_001:nan has the largest difference of 2953.70 (vary due to sample)
  2. os_021:csf_001:nan has the 2nd largest difference of 1189.69 (vary due to sample)

Therefore, we can conclude that the failure of os_021 can be attributed to the large increase in time taken to execute the osb_001 code. Additionally, since csf_001 is the outer span and child of osb_001, we can conclude that there is a large increase in time to pass control to other microservices called by osb_001.

What this means to the developer: Code review should be performed on osb_001 to understand the cause for the increase in time taken to execute the code and call other microservices. Relating this back to the Host Indicators, we can suspect that the code cannot handle an increase in requests (i.e. Sent_queue increased sharply) well.

What this means to site reliability engineer / ML engineer: We can potentially detect virtual machine failure by observing the osb_001 and csf_001 metrics, for large increase in execution time.

Finding the root cause for the slower processing speed

unique_identifier train_actual_time test_actual_time difference
46 os_021:osb_001:nan 14.80 2941.75 -2926.95
45 os_021:csf_001:nan 5.80 1187.10 -1181.30
17 docker_004:csf_001:nan 16.54 67.60 -51.06
0 docker_003:csf_001:nan 20.24 60.82 -40.58
16 docker_003:nan:db_009 7.98 27.30 -19.32
33 docker_004:nan:db_009 8.12 18.12 -10.00
9 docker_003:local_method_004:db_009 3.36 11.51 -8.15
4 docker_003:csf_005:nan 3.91 10.53 -6.62
2 docker_003:csf_003:nan 2.36 8.74 -6.38
19 docker_004:csf_003:nan 3.27 9.49 -6.22
20 docker_004:csf_004:nan 2.12 7.55 -5.43
1 docker_003:csf_002:nan 5.83 9.80 -3.97
5 docker_003:fly_remote_001:nan 3.16 7.03 -3.87
23 docker_004:local_method_001:db_009 0.42 4.16 -3.74
31 docker_004:local_method_010:db_007 1.55 5.24 -3.69
22 docker_004:fly_remote_001:nan 2.39 6.04 -3.65
30 docker_004:local_method_009:db_009 1.26 4.90 -3.64
3 docker_003:csf_004:nan 3.38 6.97 -3.59
18 docker_004:csf_002:nan 4.61 8.12 -3.51
26 docker_004:local_method_004:db_009 2.18 5.57 -3.39
8 docker_003:local_method_003:db_009 0.99 4.20 -3.21
32 docker_004:nan:db_007 3.25 6.14 -2.89
15 docker_003:nan:db_007 2.73 5.40 -2.67
21 docker_004:csf_005:nan 3.52 6.00 -2.48
12 docker_003:local_method_007:db_009 1.11 3.47 -2.36
27 docker_004:local_method_005:db_009 0.51 2.34 -1.83
28 docker_004:local_method_006:db_009 1.92 3.65 -1.73
6 docker_003:local_method_001:db_009 0.43 2.11 -1.68
29 docker_004:local_method_007:db_009 0.59 2.26 -1.67
24 docker_004:local_method_002:db_009 1.98 3.39 -1.41
10 docker_003:local_method_005:db_009 1.15 2.49 -1.34
14 docker_003:local_method_010:db_007 2.31 3.65 -1.34
7 docker_003:local_method_002:db_009 3.47 4.79 -1.32
13 docker_003:local_method_009:db_009 2.17 3.32 -1.15
11 docker_003:local_method_006:db_009 2.79 3.86 -1.07
25 docker_004:local_method_003:db_009 0.48 0.57 -0.09
34 docker_006:csf_002:nan 25.97 NaN NaN
35 docker_006:csf_003:nan 14.11 NaN NaN
36 docker_006:csf_004:nan 13.10 NaN NaN
37 docker_006:csf_005:nan 11.68 NaN NaN
38 docker_006:local_method_011:db_003 2.28 NaN NaN
39 docker_006:local_method_013:db_003 2.97 NaN NaN
40 docker_006:local_method_014:db_003 1.63 NaN NaN
41 docker_006:local_method_015:db_003 2.41 NaN NaN
42 docker_006:local_method_016:db_003 1.73 NaN NaN
43 docker_006:local_method_017:db_003 2.18 NaN NaN
44 docker_006:nan:db_003 25.03 NaN NaN
47 docker_007:csf_002:nan NaN 13.22 NaN
48 docker_007:csf_003:nan NaN 7.33 NaN
49 docker_007:csf_004:nan NaN 6.87 NaN
50 docker_007:csf_005:nan NaN 5.07 NaN
51 docker_007:local_method_011:db_003 NaN 1.15 NaN
52 docker_007:local_method_013:db_003 NaN 1.31 NaN
53 docker_007:local_method_014:db_003 NaN 0.79 NaN
54 docker_007:local_method_015:db_003 NaN 1.44 NaN
55 docker_007:local_method_016:db_003 NaN 0.79 NaN
56 docker_007:local_method_017:db_003 NaN 0.92 NaN
57 docker_007:nan:db_003 NaN 12.63 NaN
58 docker_008:csf_002:nan NaN 13.58 NaN
59 docker_008:csf_003:nan NaN 6.22 NaN
60 docker_008:csf_004:nan NaN 5.72 NaN
61 docker_008:csf_005:nan NaN 5.96 NaN
62 docker_008:local_method_011:db_003 NaN 1.05 NaN
63 docker_008:local_method_013:db_003 NaN 0.99 NaN
64 docker_008:local_method_014:db_003 NaN 1.02 NaN
65 docker_008:local_method_015:db_003 NaN 1.06 NaN
66 docker_008:local_method_016:db_003 NaN 0.67 NaN
67 docker_008:local_method_017:db_003 NaN 0.98 NaN
68 docker_008:nan:db_003 NaN 13.13 NaN

Breaking Down Failure #2¶

Failure #2 Details:

[
    6960, 
    [[
        "docker_007", 
        null
    ]]
]

ESB Indicators¶

Analysis: From the 4 line charts, we observe that soon after the failure time (01:56:00) the success rate dipped a bit before drastically dropping. This trend is confirmed by the num chart and succee_num chart, which show the values dipping slightly (from 01:58:00 to 01:59:00) before experiencing a sharp decline (from 02:02:00 to 02:03:00). Finally, the average time taken to process the service requests also increased slightly before a sharp increment later.

From these indicators, we cannot really deduce what is the problem. However, this could be indicative of a failure that propagated to the other microservices and caused them to malfunction. In this assumption, the small dip in success rate could be due to the initial failure, that was soon fixed. However, the failure could have propagated to other microservices, leading to a domino effect and causing a large number of service requests to fail.

serviceName startTime avg_time num succee_num succee_rate
118 osb_001 01:58:00 2.1664 371 370 0.9973
119 osb_001 01:59:00 1.1802 512 493 0.9629
122 osb_001 02:02:00 0.7779 568 23 0.0405
123 osb_001 02:03:00 5.1987 224 118 0.5268
124 osb_001 02:04:00 1.6619 356 355 0.9972

Host Indicators¶

FYI: From cmdb.xlsx, docker_007 is also mapped to os_019. However, we will only check the host data in dcos_docker.csv

Analysis: Trying out all possible name for docker_007, we note the following observations:

  1. container_cpu_used -> Although there are fluctuations compared to train data, they do not appear anomalous when compared other dockers (i.e. docker_006) in test data
  2. container_mem_used -> Although there are fluctuations compared to train data, they do not appear anomalous when compared other dockers (i.e. docker_006) in test data
  3. container_fgc -> All values are 0
  4. container_fgct -> All values are 0
  5. container_session_used -> All values are 0
  6. container_thread_idle -> All values are 0
  7. container_thread_running -> All values are 0
  8. container_thread_total -> All values are 0
  9. container_thread_used_pct -> All values are 0

The major dips in the values were mostly observered around 02:06:00. Comparing the graphs against the ESB Indicators, we observe that these large decrease in values happen slightly after the drastic decrease in success rate. Therefore, we would suspect that they could be inter-related.

itemid name bomc_id timestamp value cmdb_id
8108 999999996381307 container_mem_used ZJ-004-060 02:06:04 42.0 docker_007
8135 999999996381307 container_mem_used ZJ-004-060 02:06:28 50.0 docker_007
8216 999999996381307 container_mem_used ZJ-004-060 02:07:41 55.0 docker_007
8279 999999996381307 container_mem_used ZJ-004-060 02:08:37 55.0 docker_007

Trace Indicators¶

Analysis: Since we know that docker_007 has a failure and that all traceId begin from a root process that must start from either os_021 or os_022 (refer to Data Exploration's Finding #6), we will plot the following graphs - (1) comparing the length of trace data, (2) comparing the elapsed time of the parent processes for os_021, (3) comparing the elapsed time of the parent processes for os_022

These graphs' time period are carefully chosen to ensure that long-elapse-time functions are shown and there are ample data before & after to ascertain the trend.

From the graphs, we can tell that:

  1. Number of trace invocations (trace length) remained roughly the same, except from 02:02:00 to 02:03:00 (time may change due to sampling)
  2. os_021 experienced (1) a distinct fluctuation in elapsed time (even going negative) as compared to train_data and os_022 (for the same time period), (2) higher elapsed time around 01:59:00 and 02:03:00
  3. os_022 experienced (1) lower call frequency around 01:59:00 and 02:02:00, (2) higher elapsed time around 02:00:00 and 02:03:00

Finding 1: Most traceIds contain traces from different dockers e.g. `docker_003`, `docker_004`, `docker_006`, etc

However, docker_007 is more similar to docker_005 to docker_008 than docker_001 to docker_004. Therefore, this will shape the logic subsequently, when we compared the docker metrics across to detect anomalies.

Subsequently, we sampled for 100 traceIDs (since 100 is a large sample size for approximation) from both the train and test data to:

  1. Calculate actual time taken for each process ID, combining them through a unique identifier (in the format of cmdb_id:serviceName:dsName)
  2. Find out the average time taken across all 100 traceIDs for their respective
  3. Merge the train and test dataframe to compute their differences (where possible)
  4. Filter for docker_006, docker_007, docker_008 in the uniquer_identifier (since they share similar metrics and trace calls)
  5. Plot and compare their values to observe for anomalies

From the output graph of the above step, we notice that:

  1. csf_002:None has the largest difference (vary due to sample)

Therefore, since csf_002 is the outer span, we can conclude that the failure of docker_007 can be attributed to the large increase in time to pass control to other microservices called by csf_002.

What this means to the developer: Code review should be performed on docker_007:csf_002:None to understand the cause for the increase in time taken to call other microservices.

What this means to site reliability engineer / ML engineer: We can potentially detect docker failure (for docker_005 to docker_008) by observing the csf_002:None metrics, for large increase in actual time. Alternatively, we can try using trace invocation length as an additional indicator.

Finding the root cause for the slower processing speed

unique_identifier train_actual_time test_actual_time difference
0 docker_003:csf_001:nan 24.98 2052.32 -2027.34
17 docker_004:csf_001:nan 15.57 1390.54 -1374.97
45 os_021:csf_001:nan 8.25 664.86 -656.61
26 docker_004:local_method_004:db_009 4.38 238.56 -234.18
4 docker_003:csf_005:nan 2.87 233.19 -230.32
1 docker_003:csf_002:nan 3.58 203.80 -200.22
9 docker_003:local_method_004:db_009 5.66 179.34 -173.68
3 docker_003:csf_004:nan 3.32 149.33 -146.01
24 docker_004:local_method_002:db_009 2.10 146.53 -144.43
18 docker_004:csf_002:nan 4.68 134.23 -129.55
16 docker_003:nan:db_009 7.11 127.81 -120.70
7 docker_003:local_method_002:db_009 3.66 106.97 -103.31
33 docker_004:nan:db_009 7.54 106.69 -99.15
19 docker_004:csf_003:nan 4.25 92.63 -88.38
28 docker_004:local_method_006:db_009 2.21 90.27 -88.06
6 docker_003:local_method_001:db_009 1.78 84.31 -82.53
20 docker_004:csf_004:nan 4.10 84.77 -80.67
11 docker_003:local_method_006:db_009 2.24 75.86 -73.62
31 docker_004:local_method_010:db_007 1.42 71.01 -69.59
2 docker_003:csf_003:nan 3.89 70.40 -66.51
21 docker_004:csf_005:nan 1.89 66.99 -65.10
14 docker_003:local_method_010:db_007 1.87 62.79 -60.92
10 docker_003:local_method_005:db_009 0.77 47.80 -47.03
30 docker_004:local_method_009:db_009 2.91 48.47 -45.56
29 docker_004:local_method_007:db_009 1.48 42.46 -40.98
27 docker_004:local_method_005:db_009 1.18 42.13 -40.95
8 docker_003:local_method_003:db_009 0.75 37.83 -37.08
12 docker_003:local_method_007:db_009 0.69 36.74 -36.05
32 docker_004:nan:db_007 3.09 38.00 -34.91
13 docker_003:local_method_009:db_009 1.38 33.58 -32.20
25 docker_004:local_method_003:db_009 0.77 29.06 -28.29
22 docker_004:fly_remote_001:nan 3.44 30.26 -26.82
23 docker_004:local_method_001:db_009 0.45 23.39 -22.94
5 docker_003:fly_remote_001:nan 3.78 25.70 -21.92
15 docker_003:nan:db_007 4.55 26.21 -21.66
46 os_021:osb_001:nan 18.60 34.32 -15.72
34 docker_006:csf_002:nan 26.38 NaN NaN
35 docker_006:csf_003:nan 14.16 NaN NaN
36 docker_006:csf_004:nan 12.94 NaN NaN
37 docker_006:csf_005:nan 12.24 NaN NaN
38 docker_006:local_method_011:db_003 2.15 NaN NaN
39 docker_006:local_method_013:db_003 2.61 NaN NaN
40 docker_006:local_method_014:db_003 2.13 NaN NaN
41 docker_006:local_method_015:db_003 2.95 NaN NaN
42 docker_006:local_method_016:db_003 1.80 NaN NaN
43 docker_006:local_method_017:db_003 2.24 NaN NaN
44 docker_006:nan:db_003 25.62 NaN NaN
47 docker_007:csf_002:nan NaN 89.90 NaN
48 docker_007:csf_003:nan NaN 31.01 NaN
49 docker_007:csf_004:nan NaN 14.09 NaN
50 docker_007:csf_005:nan NaN 8.20 NaN
51 docker_007:local_method_011:db_003 NaN 3.34 NaN
52 docker_007:local_method_013:db_003 NaN 2.98 NaN
53 docker_007:local_method_014:db_003 NaN 2.61 NaN
54 docker_007:local_method_015:db_003 NaN 3.02 NaN
55 docker_007:local_method_016:db_003 NaN 2.09 NaN
56 docker_007:local_method_017:db_003 NaN 1.75 NaN
57 docker_007:nan:db_003 NaN 15.77 NaN
58 docker_008:csf_002:nan NaN 49.84 NaN
59 docker_008:csf_003:nan NaN 11.46 NaN
60 docker_008:csf_004:nan NaN 17.50 NaN
61 docker_008:csf_005:nan NaN 10.61 NaN
62 docker_008:local_method_011:db_003 NaN 3.53 NaN
63 docker_008:local_method_013:db_003 NaN 3.55 NaN
64 docker_008:local_method_014:db_003 NaN 1.07 NaN
65 docker_008:local_method_015:db_003 NaN 2.69 NaN
66 docker_008:local_method_016:db_003 NaN 1.10 NaN
67 docker_008:local_method_017:db_003 NaN 1.79 NaN
68 docker_008:nan:db_003 NaN 14.62 NaN

Breaking Down Failure #3¶

Failure #3 Details:

[
    12480,
    [
      ["os_017", "Recv_total"],
      ["os_017", "Send_total"]
    ]
]

ESB Indicators¶

Analysis: From the 4 line charts, we observe that although the success rate did not change (subplot #4 remained constant at 1), the average time taken to process each service call increased sharply and the number of services requests handled decreased around the failure time period. This unusual behaviour can be noted by comparing against itself across 1 hour, as well as comparing against the train data (normal data with no failures) across the same time period. This indicates that failure #1 does not result in a failure of a service call, but could lead to other side effects such as longer call times and fewer services requests handled. This phenomenon could potentially be due to the app's error handling or simply due to poorer load balancing effects.

Given that os_017 is the cause and since os_017 is a virtual machine hosting the app services (2 dockers corresponding to each app), we suspect that the failure of os_017 meant that there are fewer machines processing the service requests. Interestingly, the num decreased to lower than 20% of its original level despite os_017 only making up 1 of the 4 other OS that hosts both dockers for each app. Therefore, we will need to investigate more to understand this phenomenon.

Host Indicators¶

Analysis: Given that Recv_total and Send_total are host KPIs to observe for failure #3, we plot the following graph - comparing Recv_total and Send_total values across 60 mins (corresponding to 60 data points) and against the train data for the same time period.

We observe that the both KPI values experienced a drastic decrease after the failure time (03:28:00), before returning back to normal around 03:35:00. Recv_total and Send_total represent the number of service requests received for processing and sent after processing.

In conclusion, we note that a possible failure could involve the virtual machine hosting the services. Such failures, although would not affect the service's success rate, would affect its latency and lead to slower execution. To identify these problems during production, we can observe for any drastic / abnormal increase in the operating system's Recv_total and Send_total metrics.

Trace Indicators¶

Analysis: Since we know that os_017 (which hosts docker_001 and docker_005) has a failure and that all traceId begin from a root process that must start from either os_021 or os_022 (refer to Data Exploration's Finding #6), we will plot the following graphs - (1) comparing the length of trace data, (2) comparing the elapsed time of the parent processes for os_021, (3) comparing the elapsed time of the parent processes for os_022

These graphs' time period are carefully chosen to ensure that long-elapse-time functions are shown and there are ample data before & after to ascertain the trend.

From the graphs, we can tell that:

  1. Number of trace invocations (trace length) remained roughly the same despite failure
  2. os_021 did not exeperience much change in elapsed time (though it has lower call frequency)
  3. os_022 experienced higher elapsed time persisting for about 5 mins (from 03:28:00 to 03:33:00)

Similar to Failure #2's Trace Indicators Explanation, we will compare specific dockers

docker_001 is more similar to docker_002 to docker_004 while docker_005 is more similar to docker_006 to docker_008. Therefore, this will shape the logic subsequently, when we compared the docker metrics across to detect anomalies.

Subsequently, we sampled for 100 traceIDs (since 100 is a large sample size for approximation) from both the train and test data to:

  1. Calculate actual time taken for each process ID, combining them through a unique identifier (in the format of cmdb_id:serviceName:dsName)
  2. Find out the average time taken across all 100 traceIDs for their respective
  3. Merge the train and test dataframe to compute their differences (where possible)
  4. Filter for docker_001, docker_002 in the uniquer_identifier (since they share similar metrics and trace calls)
  5. Plot and compare their values to observe for anomalies using a bar chart
  6. Filter for docker_005, docker_006, docker_007, docker_008 in the uniquer_identifier (since they share similar metrics and trace calls)
  7. Plot and compare their values to observe for anomalies using a line graph

From the output graph of the above step, we notice that:

  1. For docker_001, None:db_009 and csf_001:None have the largest differences
  2. For docker_005, None:db_003 and csf methods (i.e. from csf_002:None to csf_005:None) have the largest differences

Therefore, since all the above metrics identified are outer span, we can conclude that the failure of os_017 can be attributed to the large increase in time to pass control to other microservices.

What this means to the developer: Code review should be performed on the above metrics to understand the cause for the increase in time taken to call other microservices.

What this means to site reliability engineer / ML engineer: We can potentially detect the failure of OS that host docker containers (i.e. os_017 to os_020) by observing the the above identified metrics, for large increase in actual time as compared to train data.

Finding the root cause for the slower processing speed

unique_identifier train_actual_time test_actual_time difference
16 docker_001:nan:db_009 11.77 3170.99 -3159.22
0 docker_001:csf_001:nan 37.70 2814.41 -2776.71
56 os_022:csf_001:nan 7.91 595.01 -587.10
20 docker_002:csf_004:nan 8.19 576.10 -567.91
18 docker_002:csf_002:nan 10.74 537.44 -526.70
19 docker_002:csf_003:nan 7.01 470.91 -463.90
15 docker_001:nan:db_007 4.84 442.89 -438.05
21 docker_002:csf_005:nan 3.08 348.05 -344.97
2 docker_001:csf_003:nan 4.89 271.58 -266.69
4 docker_001:csf_005:nan 8.16 248.69 -240.53
5 docker_001:fly_remote_001:nan 2.64 233.06 -230.42
3 docker_001:csf_004:nan 8.69 227.82 -219.13
1 docker_001:csf_002:nan 6.84 224.58 -217.74
6 docker_001:local_method_001:db_009 0.58 92.43 -91.85
57 os_022:osb_001:nan 16.12 90.10 -73.98
17 docker_002:csf_001:nan 23.69 82.70 -59.01
11 docker_001:local_method_006:db_009 2.14 4.11 -1.97
29 docker_002:local_method_007:db_009 0.55 1.41 -0.86
22 docker_002:fly_remote_001:nan 4.25 4.52 -0.27
10 docker_001:local_method_005:db_009 2.13 2.05 0.08
24 docker_002:local_method_002:db_009 2.39 2.30 0.09
8 docker_001:local_method_003:db_009 1.55 1.40 0.15
25 docker_002:local_method_003:db_009 0.66 0.49 0.17
27 docker_002:local_method_005:db_009 2.00 1.27 0.73
23 docker_002:local_method_001:db_009 2.09 1.17 0.92
13 docker_001:local_method_009:db_009 1.58 0.62 0.96
12 docker_001:local_method_007:db_009 2.74 0.64 2.10
30 docker_002:local_method_009:db_009 2.60 0.46 2.14
26 docker_002:local_method_004:db_009 5.87 3.72 2.15
31 docker_002:local_method_010:db_007 3.30 0.87 2.43
7 docker_001:local_method_002:db_009 4.65 1.79 2.86
28 docker_002:local_method_006:db_009 4.98 1.67 3.31
33 docker_002:nan:db_009 9.95 6.46 3.49
9 docker_001:local_method_004:db_009 6.45 2.75 3.70
14 docker_001:local_method_010:db_007 5.19 0.97 4.22
32 docker_002:nan:db_007 9.49 4.58 4.91
34 docker_007:csf_002:nan 14.12 NaN NaN
35 docker_007:csf_003:nan 7.29 NaN NaN
36 docker_007:csf_004:nan 7.19 NaN NaN
37 docker_007:csf_005:nan 6.70 NaN NaN
38 docker_007:local_method_011:db_003 1.10 NaN NaN
39 docker_007:local_method_013:db_003 1.33 NaN NaN
40 docker_007:local_method_014:db_003 0.92 NaN NaN
41 docker_007:local_method_015:db_003 1.19 NaN NaN
42 docker_007:local_method_016:db_003 0.96 NaN NaN
43 docker_007:local_method_017:db_003 1.63 NaN NaN
44 docker_007:nan:db_003 14.52 NaN NaN
45 docker_008:csf_002:nan 13.42 NaN NaN
46 docker_008:csf_003:nan 7.79 NaN NaN
47 docker_008:csf_004:nan 8.13 NaN NaN
48 docker_008:csf_005:nan 6.61 NaN NaN
49 docker_008:local_method_011:db_003 1.09 NaN NaN
50 docker_008:local_method_013:db_003 1.49 NaN NaN
51 docker_008:local_method_014:db_003 0.90 NaN NaN
52 docker_008:local_method_015:db_003 1.12 NaN NaN
53 docker_008:local_method_016:db_003 0.74 NaN NaN
54 docker_008:local_method_017:db_003 0.89 NaN NaN
55 docker_008:nan:db_003 13.59 NaN NaN
58 docker_005:csf_002:nan NaN 1276.10 NaN
59 docker_005:csf_003:nan NaN 547.34 NaN
60 docker_005:csf_004:nan NaN 639.52 NaN
61 docker_005:csf_005:nan NaN 495.06 NaN
62 docker_005:local_method_011:db_003 NaN 1.37 NaN
63 docker_005:local_method_013:db_003 NaN 1.51 NaN
64 docker_005:local_method_014:db_003 NaN 0.89 NaN
65 docker_005:local_method_015:db_003 NaN 1.64 NaN
66 docker_005:local_method_016:db_003 NaN 1.08 NaN
67 docker_005:local_method_017:db_003 NaN 1.18 NaN
68 docker_005:nan:db_003 NaN 2890.51 NaN
69 docker_006:csf_002:nan NaN 9.19 NaN
70 docker_006:csf_003:nan NaN 5.05 NaN
71 docker_006:csf_004:nan NaN 3.64 NaN
72 docker_006:csf_005:nan NaN 4.70 NaN
73 docker_006:local_method_011:db_003 NaN 0.64 NaN
74 docker_006:local_method_013:db_003 NaN 0.71 NaN
75 docker_006:local_method_014:db_003 NaN 0.51 NaN
76 docker_006:local_method_015:db_003 NaN 0.58 NaN
77 docker_006:local_method_016:db_003 NaN 0.41 NaN
78 docker_006:local_method_017:db_003 NaN 0.77 NaN
79 docker_006:nan:db_003 NaN 10.50 NaN

Breaking Down Failure #4¶

Failure #4 Details:

[
    16080,
    [
      ["db_003", "On_Off_State"],
      ["db_003", "tnsping_result_time"]
    ]
]

ESB Indicators¶

Analysis: From the 4 line charts, we observe that soon after the failure time (04:28:00) the success rate decreased drastically twice consecutively, with the first occurence taking a longer time to recover than the second. This trend is a result of fewer successful calls (the graph for it shares similar pattern with success rate) as compared to num. On the other hand, the average time to process the service requests fluctuated after the failure time, dipping then increasing sharply.

The above patterns repeated itself for once more soon after the system recovered.

From these indicators, we suspect that a fault with 1 of the microservices led to a faster-than-before return of the service request, in the form of a failed request. This is supported by the shorter average time accompanied with lower success rate. On the other hand, the peak in average time can be explained as the "cold start" for connecting to the microservice that could have just recovered. Nonetheless, these are speculations that need to be affirmed through other indicators.

Host Indicators¶

Analysis: Given that On_Off_State and tnsping_result_time are host KPIs to observe for failure #4, we plot the following graph - comparing Recv_total and Send_total values across 60 mins (corresponding to 60 data points) and against the train data for the same time period.

We observe that On_Off_State was toggled from 1 (indicating "on") to 0 (indicating "off") during the failure time. At the same time, tnsping_result_time increased sharply to 9,999 ms (which is likely the timeout duration). These indicate that the database was "switched off" / unavailable during the time of failure and it led to timeout errors when attempting to connect to the affected database.

In conclusion, we note that a possible failure could involve the database that services require. Such failures would affect the service's success rate, and affect its latency while the distributed system attempts to fix the problem (e.g. by rerouting to another database). To identify these problems during production, we can observe for any drastic / abnormal increase in the operating system's On_Off_State and tnsping_result_time metrics. Specifically, we can observe for the time period when On_Off_State == 0 and tnsping_result_time == 9999.

Trace Indicators¶

Analysis: Since we know that db_003 has a failure and that all traceId begin from a root process that must start from either os_021 or os_022 (refer to Data Exploration's Finding #6), we will plot the following graphs - (1) comparing the length of trace data, (2) comparing the elapsed time of the parent processes for os_021, (3) comparing the elapsed time of the parent processes for os_022

These graphs' time period are carefully chosen to ensure that long-elapse-time functions are shown and there are ample data before & after to ascertain the trend.

From the graphs, we can tell that:

  1. Number of trace invocations (trace length) experienced sharp decreases from 04:28:00 to 04:36:00
  2. os_021 experienced (1) lower elapsed time from 04:28:00 to 04:37:00, (2) higher elapsed time from 04:37:00 and 04:45:00
  3. os_022 experienced (1) lower elapsed time from 04:28:00 to 04:37:00, (2) a spike in elapsed time around 04:29:00 to 04:30:00, (3) higher elapsed time around 04:37:00 and 04:45:00

Observation 1 corresponds very much with the os_021 graph, with the fluctuations matching closely.

Similar to Failure #2's Trace Indicators Explanation, we will try to compare specific databases

db_003 appears to be mapped with local_method_011 to local_method_017 (excluding local_method_012), therefore mapped with docker_005 to docker_008. db_009 appears to be mapped with local_method_001 to local_method_010 (excluding local_method_008), therefore mapped with docker_001 to docker_004.

Therefore, this will shape the logic subsequently, when we compared the docker metrics across to detect anomalies.

Subsequently, we sampled for 100 traceIDs (since 100 is a large sample size for approximation) from both the train and test data to:

  1. Calculate actual time taken for each process ID, combining them through a unique identifier (in the format of cmdb_id:serviceName:dsName)
  2. Find out the average time taken across all 100 traceIDs for their respective
  3. Merge the train and test dataframe to compute their differences (where possible)
  4. Filter for docker_006, docker_007, docker_008 in the uniquer_identifier (since they share similar metrics and trace calls)
  5. Plot and compare their values to observe for anomalies

Note that docker_006 is from train data, therefore it is faultless, while docker_007 and docker_008 are from test data.

From the output graph of the above step, we notice that:

  1. If db_003 has fault, then the invocation trace length will be shorter (2 instead of 11). This means that the invocation stopped and possibly an error message is returned.
  2. If db_003 has fault, then there would be a large difference in actual time for None:db_003. This can be understood as the time to query the database is longer than the timeout duration.

What this means to the developer: Code review should be performed to understand the cause for the database failure and if the aftermath effects are intended and errors are properly handled.

What this means to site reliability engineer / ML engineer: We can potentially detect database failure by observing the None:db_003 metrics, for shorter actual time. Additionally, we can use the trace invocation length as another indicator for failures.

Finding the root cause for the slower processing speed

unique_identifier train_actual_time test_actual_time difference
46 os_021:osb_001:nan 14.33 17.23 -2.90
18 docker_004:csf_002:nan 3.38 4.44 -1.06
1 docker_003:csf_002:nan 4.00 4.20 -0.20
22 docker_004:fly_remote_001:nan 3.03 2.10 0.93
5 docker_003:fly_remote_001:nan 3.56 2.59 0.97
45 os_021:csf_001:nan 8.49 2.79 5.70
0 docker_003:csf_001:nan 16.78 8.36 8.42
17 docker_004:csf_001:nan 16.77 6.96 9.81
2 docker_003:csf_003:nan 2.24 NaN NaN
3 docker_003:csf_004:nan 2.63 NaN NaN
4 docker_003:csf_005:nan 3.00 NaN NaN
6 docker_003:local_method_001:db_009 0.44 NaN NaN
7 docker_003:local_method_002:db_009 2.10 NaN NaN
8 docker_003:local_method_003:db_009 1.23 NaN NaN
9 docker_003:local_method_004:db_009 5.06 NaN NaN
10 docker_003:local_method_005:db_009 0.70 NaN NaN
11 docker_003:local_method_006:db_009 2.39 NaN NaN
12 docker_003:local_method_007:db_009 2.26 NaN NaN
13 docker_003:local_method_009:db_009 1.43 NaN NaN
14 docker_003:local_method_010:db_007 2.30 NaN NaN
15 docker_003:nan:db_007 4.18 NaN NaN
16 docker_003:nan:db_009 5.73 NaN NaN
19 docker_004:csf_003:nan 3.57 NaN NaN
20 docker_004:csf_004:nan 4.48 NaN NaN
21 docker_004:csf_005:nan 3.42 NaN NaN
23 docker_004:local_method_001:db_009 0.51 NaN NaN
24 docker_004:local_method_002:db_009 2.47 NaN NaN
25 docker_004:local_method_003:db_009 0.64 NaN NaN
26 docker_004:local_method_004:db_009 3.84 NaN NaN
27 docker_004:local_method_005:db_009 1.87 NaN NaN
28 docker_004:local_method_006:db_009 2.97 NaN NaN
29 docker_004:local_method_007:db_009 2.48 NaN NaN
30 docker_004:local_method_009:db_009 1.28 NaN NaN
31 docker_004:local_method_010:db_007 2.66 NaN NaN
32 docker_004:nan:db_007 4.76 NaN NaN
33 docker_004:nan:db_009 7.70 NaN NaN
34 docker_006:csf_002:nan 26.60 NaN NaN
35 docker_006:csf_003:nan 14.76 NaN NaN
36 docker_006:csf_004:nan 14.70 NaN NaN
37 docker_006:csf_005:nan 12.31 NaN NaN
38 docker_006:local_method_011:db_003 2.26 NaN NaN
39 docker_006:local_method_013:db_003 3.58 NaN NaN
40 docker_006:local_method_014:db_003 1.93 NaN NaN
41 docker_006:local_method_015:db_003 3.99 NaN NaN
42 docker_006:local_method_016:db_003 2.07 NaN NaN
43 docker_006:local_method_017:db_003 3.08 NaN NaN
44 docker_006:nan:db_003 26.82 NaN NaN
47 docker_007:csf_002:nan NaN 11.91 NaN
48 docker_007:local_method_011:db_003 NaN 2.56 NaN
49 docker_008:csf_002:nan NaN 10.19 NaN
50 docker_008:local_method_011:db_003 NaN 2.26 NaN

Breaking Down Failure #5¶

Failure #5 Details:

[
    28800, 
    [[
        "docker_005", 
        null
    ]]
]

ESB Indicators¶

Analysis: From the 4 line charts, we observe that although the success rate did not change (subplot #4 remained constant at 1), the average time taken to process each service call increased sharply and the number of services requests handled decreased around the failure time period. This unusual behaviour can be noted by comparing against itself across 1 hour, as well as comparing against the train data (normal data with no failures) across the same time period. This indicates that failure #1 does not result in a failure of a service call, but could lead to other side effects such as longer call times and fewer services requests handled. This phenomenon could potentially be due to the app's error handling or simply due to poorer load balancing effects.

Host Indicators¶

FYI: From cmdb.xlsx, docker_005 is also mapped to os_017. However, we will only check the host data in dcos_docker.csv

Analysis: Trying out all possible name for docker_005, we note the following observations:

  1. container_cpu_used -> Although there are fluctuations compared to train data, they do not appear anomalous when compared other dockers (i.e. docker_007) in test data
  2. container_mem_used -> Although there are fluctuations compared to train data, they do not appear anomalous when compared other dockers (i.e. docker_007) in test data
  3. container_fgc -> All values are 0
  4. container_fgct -> All values are 0
  5. container_session_used -> All values are 0
  6. container_thread_idle -> All values are 0
  7. container_thread_running -> All values are 0
  8. container_thread_total -> All values are 0
  9. container_thread_used_pct -> All values are 0

Trace Indicators¶

Analysis: Since we know that docker_005 has a failure and that all traceId begin from a root process that must start from either os_021 or os_022 (refer to Data Exploration's Finding #6), we will plot the following graphs - (1) comparing the length of trace data, (2) comparing the elapsed time of the parent processes for os_021, (3) comparing the elapsed time of the parent processes for os_022

These graphs' time period are carefully chosen to ensure that long-elapse-time functions are shown and there are ample data before & after to ascertain the trend.

From the graphs, we can tell that:

  1. Number of trace invocations (trace length) remained roughly the same despite failure
  2. os_021 experienced (1) reduction in service call frequency, (2) lower elapsed time
  3. os_022 experienced significantly higher elapsed time

Similar to Failure #2's Trace Indicators Explanation, we will compare specific dockers

docker_005 is more similar to docker_006 to docker_008. Therefore, this will shape the logic subsequently, when we compared the docker metrics across to detect anomalies.

Subsequently, we sampled for 100 traceIDs (since 100 is a large sample size for approximation) from both the train and test data to:

  1. Calculate actual time taken for each process ID, combining them through a unique identifier (in the format of cmdb_id:serviceName:dsName)
  2. Find out the average time taken across all 100 traceIDs for their respective
  3. Merge the train and test dataframe to compute their differences (where possible)
  4. Filter for docker_005, docker_006 (docker_007 and docker_008 are not as relevant as docker_006 which is the non-faulty docker in similar circumstances / system) in the uniquer_identifier (since they share similar metrics and trace calls)
  5. Plot and compare their values to observe for anomalies

From the output graph of the above step, we notice that:

  1. None:db_003 and csf_002:None has the largest difference between docker_005, docker_006

Therefore, since db_003 and csf_002 are outer spans, we can conclude that the failure of docker_005 can be attributed to the large increase in time to pass control to other microservices called by csf_002 and the database.

What this means to the developer: Code review should be performed on docker_005:csf_002:None to understand the cause for the increase in time taken to call other microservices.

What this means to site reliability engineer / ML engineer: We can potentially detect docker failure (for docker_005 to docker_008) by observing the db_003 and csf_002:None metrics, for large increase in actual time.

Finding the root cause for the slower processing speed

unique_identifier train_actual_time test_actual_time difference
4 docker_001:csf_005:nan 6.57 842.92 -836.35
2 docker_001:csf_003:nan 7.21 795.59 -788.38
21 docker_002:csf_005:nan 3.63 786.65 -783.02
1 docker_001:csf_002:nan 8.32 782.03 -773.71
0 docker_001:csf_001:nan 27.15 789.79 -762.64
3 docker_001:csf_004:nan 7.90 709.82 -701.92
18 docker_002:csf_002:nan 6.26 706.31 -700.05
20 docker_002:csf_004:nan 6.13 682.07 -675.94
19 docker_002:csf_003:nan 5.17 671.45 -666.28
17 docker_002:csf_001:nan 27.61 471.95 -444.34
57 os_022:osb_001:nan 15.60 110.88 -95.28
56 os_022:csf_001:nan 9.58 35.59 -26.01
22 docker_002:fly_remote_001:nan 2.92 8.16 -5.24
5 docker_001:fly_remote_001:nan 2.29 5.73 -3.44
24 docker_002:local_method_002:db_009 1.80 3.44 -1.64
28 docker_002:local_method_006:db_009 2.08 3.46 -1.38
32 docker_002:nan:db_007 3.18 4.47 -1.29
12 docker_001:local_method_007:db_009 1.47 2.10 -0.63
15 docker_001:nan:db_007 2.93 3.44 -0.51
27 docker_002:local_method_005:db_009 2.11 2.20 -0.09
29 docker_002:local_method_007:db_009 0.64 0.58 0.06
13 docker_001:local_method_009:db_009 0.91 0.84 0.07
10 docker_001:local_method_005:db_009 0.85 0.57 0.28
31 docker_002:local_method_010:db_007 1.83 1.53 0.30
25 docker_002:local_method_003:db_009 0.82 0.50 0.32
7 docker_001:local_method_002:db_009 4.00 3.17 0.83
8 docker_001:local_method_003:db_009 1.83 0.48 1.35
6 docker_001:local_method_001:db_009 1.76 0.40 1.36
14 docker_001:local_method_010:db_007 4.61 3.18 1.43
30 docker_002:local_method_009:db_009 3.11 1.33 1.78
16 docker_001:nan:db_009 11.42 9.45 1.97
23 docker_002:local_method_001:db_009 2.61 0.37 2.24
33 docker_002:nan:db_009 9.51 6.86 2.65
11 docker_001:local_method_006:db_009 4.65 1.05 3.60
26 docker_002:local_method_004:db_009 6.30 1.88 4.42
9 docker_001:local_method_004:db_009 9.00 4.16 4.84
34 docker_007:csf_002:nan 12.29 NaN NaN
35 docker_007:csf_003:nan 8.82 NaN NaN
36 docker_007:csf_004:nan 8.27 NaN NaN
37 docker_007:csf_005:nan 6.36 NaN NaN
38 docker_007:local_method_011:db_003 0.96 NaN NaN
39 docker_007:local_method_013:db_003 1.69 NaN NaN
40 docker_007:local_method_014:db_003 0.94 NaN NaN
41 docker_007:local_method_015:db_003 1.68 NaN NaN
42 docker_007:local_method_016:db_003 0.95 NaN NaN
43 docker_007:local_method_017:db_003 1.12 NaN NaN
44 docker_007:nan:db_003 14.30 NaN NaN
45 docker_008:csf_002:nan 14.21 NaN NaN
46 docker_008:csf_003:nan 5.08 NaN NaN
47 docker_008:csf_004:nan 5.12 NaN NaN
48 docker_008:csf_005:nan 5.91 NaN NaN
49 docker_008:local_method_011:db_003 1.07 NaN NaN
50 docker_008:local_method_013:db_003 1.04 NaN NaN
51 docker_008:local_method_014:db_003 0.71 NaN NaN
52 docker_008:local_method_015:db_003 0.91 NaN NaN
53 docker_008:local_method_016:db_003 0.63 NaN NaN
54 docker_008:local_method_017:db_003 1.36 NaN NaN
55 docker_008:nan:db_003 11.54 NaN NaN
58 docker_005:csf_002:nan NaN 14.72 NaN
59 docker_005:csf_003:nan NaN 6.78 NaN
60 docker_005:csf_004:nan NaN 6.85 NaN
61 docker_005:csf_005:nan NaN 6.86 NaN
62 docker_005:local_method_011:db_003 NaN 0.96 NaN
63 docker_005:local_method_013:db_003 NaN 1.09 NaN
64 docker_005:local_method_014:db_003 NaN 0.70 NaN
65 docker_005:local_method_015:db_003 NaN 1.13 NaN
66 docker_005:local_method_016:db_003 NaN 0.73 NaN
67 docker_005:local_method_017:db_003 NaN 1.19 NaN
68 docker_005:nan:db_003 NaN 16.03 NaN
69 docker_006:csf_002:nan NaN 9.88 NaN
70 docker_006:csf_003:nan NaN 4.49 NaN
71 docker_006:csf_004:nan NaN 5.17 NaN
72 docker_006:csf_005:nan NaN 3.23 NaN
73 docker_006:local_method_011:db_003 NaN 0.79 NaN
74 docker_006:local_method_013:db_003 NaN 0.77 NaN
75 docker_006:local_method_014:db_003 NaN 0.52 NaN
76 docker_006:local_method_015:db_003 NaN 0.80 NaN
77 docker_006:local_method_016:db_003 NaN 0.46 NaN
78 docker_006:local_method_017:db_003 NaN 0.58 NaN
79 docker_006:nan:db_003 NaN 9.05 NaN